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Abstract 

^-^ We consider the problem of jointly training structured models for extraction from sources whose instances 

y—i enjoy partial overlap. This has important applications like user-driven ad-hoc information extraction on 

f^ the web. Such applications present new challenges in terms of the number of sources and their arbitrary 

CNj pattern of overlap not seen by earlier collective training schemes applied on two sources. We present an 

►^.^^ agreement-based learning framework and alternatives within it to trade-off tractability, robustness to noise, 

jrt and extent of agreement. We provide a principled scheme to discover low- noise agreement sets in unlabeled 

^H data across the sources. Through extensive experiments over 58 real datasets, we establish that our method 

^— ( of additively rewarding agreement over maximal segments of text provides the best trade-offs, and also scores 

^-_l over alternatives such as collective inference, staged training, and multi-view learning. 
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1 Introduction 



^ This paper addresses the problem of training multiple structured prediction models that share an output space 

I I but differ in their input data and feature space. Further, labeled data in each source is limited, but unlabeled 
data over the different sources overlap partially. This scenario is applicable in many text modeling tasks such as 
'"^ information extraction, dependency parsing, and word alignment. These tasks are increasingly being deployed 

J|T in settings where supervision is limited but redundancy is abundant. A concrete motivation for our work comes 

f— ^ from recent efforts to support rich forms of structured query- answering on the Web [3 H] . A typical subtask 

y—{ here is building extraction models over multiple Web documents starting from a small seed of user-provided 

<^ structured records. 

iy-> Recently, many learning paradigms have been proposed to exploit the relatedness of multiple models. On 

f^ one end of the spectrum we have collective inference [21] |3l [El 111 [12] where each model is trained independently 

^^ but prediction happens jointly to encourage agreement on overlapping content. On the other end are methods 

'^ like multi-view learning [TTJ [3] and agreement-based learning [T71 [TB] that formulate a single objective to jointly 

^ train all models. Then there are methods in-between that train models sequentially or alternately [2^, ^U Our 

problem is different from traditional multi-view learning where multiple models are trained on different views 
of a single data source. However, by treating the different contexts in which each shared portion resides as a 
C^ different view, we can apply multi-view learning to this problem. We elaborate on this and other alternatives in 

Section m 

In agreement-based learning [T71 [TB] the goal is to train multiple models so as to maximize the likelihood of 
the labels agreeing on shared variables. However, these assume that all models need to agree on the same set of 
variables — this trivially holds for two sources where these methods have been applied. In our application the 
number of sources is often as large as 20. As number of sources increase, there is a bewildering number of ways 
in which they overlap. This makes it challenging to devise objectives that maximally exploit the overlap while 
accounting for noise in the agreement set and intractability of training. We are aware of no study where such 
issues are addressed in the context of jointly training more than two sources with partial overlap. 

In this paper we propose an agreement-based model for training multiple structured models with arbitrary 
partial overlap among the sources. We propose several alternatives for enforcing agreement ranging from singleton 
variables, to groups of contiguous variables, to global models that lead to giant agreement graphs. For the task 
of information extraction, we present a strategy for selecting the unit of agreement that leads to a significant 
reduction in the noise in the agreement set compared to the existing naive approach for choosing agreement 
sets. We present an extensive evaluation on 58 real-life collective extraction tasks covering a rich spectrum of 
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Figure 1: (a) Three samples sentences with ^={ (Matthew Matt Groening), (Matt Groening), (Matt Groening , 
The Simpsons), (Simpsons)} (b) The fused graph (c)-(f) Chque Agreement approximation (g)-(i) Instance Pair 
approximation. 

data characteristics. This study reveals that agreement objectives that are additive over smaller components 
provide the best accuracy because of robustness against noise in the agreement term, while providing a tractable 
inference objective. 

2 Collective training 

Our goal is to collectively train S structured prediction models, where every source s € S comes with a small 
set Ls of labeled instances and many unlabeled instances Us- The unlabeled instances from different sources 
share overlapping content and we seek to exploit this for better training. For extraction tasks, this overlap can 
be anywhere from the level of unigrams to non-contiguous segments, as illustrated in Figure [l[ a). An overlap 
finding algorithm identifies such shared parts as an agreement set A comprising of a set of cliques. Each clique 
C € A contains a list of triples (s, i, r) indicating the source s, the instance i € s, and the part r e i which has the 
same content as the other members of the clique. We do not make any assumption of mutual exclusion between 
cliques and in general each clique can span a variable number of members and token positions. In Section |3] we 
present our strategy for computing the agreement set. 

As in standard structured learning, we define a probabilistic model Ps(y|x) for each source s using its feature 
vector fs(x, y) and parameters Wg as: 



P.!!(y|x,Ws 



Z(x,Wg) 



exp(w5 •fs(x,y)) 



(1) 



The traditional goal of training is to find a w^ that maximizes the regularized likelihood of the labeled set in 
source s: 

LL,(i„w,)= Y, logP.(y|x,w,)-7^(w.) (2) 

(x,y)eL, 

We propose to augment this base objective with the likelihood that the S models agree on the labels of cliques 
in the agreement set A. We first observe that the joint distribution over the labels Y of all instances X spanning 
all sources is: 

P(Y|X,W)^n[]P(Y,,|X,„w,) (3) 

where W denotes (wi,...W5), X^^ represents instance i in source s, and Ys,; is a random variable for the 
structured output of this instance. 

Now given an agreement set A, consider the subset y_A of all possible labelings that are consistent with it: 



yA = {Y:yCe A, {s, I, r), {s' , i\ r')eC: Y s„ = Y,,,,,-} 



(4) 



The log likelihood of the agreement set is then: 

LL(3;^, W) ^ logPr(3;^) - log ^ n P(Y,,|X,„w,) (5) 

In the rest of the paper we use the short form (s, i) to denote the instance i in source s. Our goal now is to jointly 
train Wi, . . . , wg so as to maximize a weighted combination of the likelihoods of the labeled and agreement sets: 

^LL(L„w,) + ALL(3^^,W) (6) 



max 

Wl ,...,ws • 



2.1 Computing LL(3^^,W) 

Using Equations fll [3I andlsj we rewrite LL(3^_4,W) as 

LL(3^^,W)=log ^ exp(^w,f,(Y,„w,))-^logZ(X,„w,) (7) 

'y^yA ^-'^ s,i 

The second part in this equation is the sum of the log partition function over individual instances which can be 
computed efficiently as long as the base models are tractable. 

The first part is equal to the log partition function of a fused graphical model G_a constructed as follows: 
Initially, each instance (s, i) creates a graph Gsi corresponding to its model Ps{)- For text tasks, this would 
typically be a chain model with a node for each token position. Next, for each clique C ^ A, and for each pair 
of triples (s, i, r), (s', i', r') e C, we collapse the nodes of r in Gsi with the corresponding nodes of r' in Gs'i' ■ In 
Figure WCb) we show an example of such a fused graph created from the three instances of Figure W[a) with four 
cliques in their agreement set. 

Let K be the number of nodes in the final fused graph and zi, . . . ,zk denote the node variables. Every 
node j in the initial graph Gsi is now mapped to some final node fee 1, • • • , K, and we denote this mapping by 
7r(s,i,j). The log-potential for a component c in the fused graph is simply an aggregate of the log-potentials of 
the members c' that collapsed onto it. 

0c (Zc)- Y. W,f,(Ze-,X,„c') (8) 

(s,z,c'):7r{s,i,c')— c 

where we extend tt to operate on node-sets as well. The above 9 parameters now define a distribution over the 
fused variables zi, . . . ,zk as follows: 

P^(z|0) = ^^exp(^0e(zc)) (9) 

It is easy to see that the log partition function of this distribution is the same as the first term of Equation [7l so 
we can work with G_4 instead. If the set of cliques in A is such that the fused graph G^ has a small tree width, 
we can compute the log partition function log Zj^{9) efficiently. In other cases, we need to approximate the term 
in various ways. We discuss several such approximations in Section [2. 3[ 



2.2 Training algorithm 

The overall training objective of Equation [6] is not necessarily concave in Wg because of the agreement term 
with sums within a log. As in |17l 111] it is easy to derive a variational approximation with extra variables 
to be solved using an EM algorithm. EM will give a local optima if the marginals of the P^ distribution can 
be computed exactly. Since this cannot be guaranteed for general fused graphs, we also explore the simpler 
approach of gradient ascent. In Section [5] we show that gradient ascent achieves better accuracy than EM while 
being faster. The gradient of LL(Y_4, W) is 

VLL(Y^,W) = ^^(M^,7r(s,j,c)(yc) -/isx(yc|X5.,))fs(Xsi,yc,c) 

s,i,c Ye 

where we use the notation ^s,ci tJ'A,c' to denote the marginal probability at c of P^ and c' of P_4 respectively. Note 
that the E-step of EM requires the computation of the same kind of marginal variables. These are computed 
using the same inference algorithms as used to compute the log partition function and we discuss the various 
options next. 



2.3 Approximations 

We explore two categories of approximations for training when LL(^, W) is intractable. 

2.3.1 Partitioning the agreement set 

The first category is based on approximating the Pr(3^_4) distribution with product of simpler distributions 
obtained by partitioning the set A. We partition the agreement set A into smaller subsets Ai , ■ ■ ■ , Ar such that 
each P{yAk) is easy to compute and n^J^^^ = 3^^. We then approximate Pr(3^^) by Hfc P(3^y^fc)i thus replacing 
the corresponding log-likehhood term by ^^LL(^fe,W). We explore three such partitionings: 

Clique Agreement In this case we have one partition per clique C G A. Gj, now decomposes into several 
simpler graphs, where a simple graph has its nodes fused only at one clique. Figures fT|c)-(f) illustrate this 
decomposition for the fused model of Figure llTb). The probability Pr(3^{c}) of agreement on members of a 
single clique C simplifies to 

P'-(3^{c})= E n P.(Y.,. = y|X,,) (10) 

ySYc (s,i.r)eC 

where Y^ is set of all possible labelings for any member of C, and Vg{Y gi,. — y) is the marginal probability of 
the part r taking the labeling y under P^. 

This approximation is useful for two reasons. First, if the base models are sequences (e.g. in typical extraction 
tasks) and clique parts r are over contiguous positions in the sequence, the fused graph of Pr(3^{c}) is always a 
tree, such as the ones in Figures [irc)-(f). Second, since for trees we can use sum-product to compute Pr(3^{(7}) 
instead of Equation |10[ we can now use arbitrarily long cliques, instead of choosing unigram cliques which is 
typically the norm in extraction applications. 

Node Agreement We also consider a special case of the clique agreement approximation, called node agree- 
ment, in which each partition corresponds to agreement over a single variable as in Figure [I[e). 

Instance Pair Agreement Another decomposition is based on picking pairs of instances and defining an 
agreement set on all cliques which they share. For the example in Figure 111 graphs marked (g),(h),and (i) 
demonstrate the fused graphs arising out of instance pair agreement. This scheme is expected to be useful 
when base models exhibit strong edge potentials. However, unlike for the above two decompositions, there is no 
guarantee that the fused graph is a tree (e.g. graph (g)). So, approximate inference may be required for some 
pairs. 

2.3.2 Approximating Zji^{9) 

An alternate way to approximate LL(3^_4, W) is to stick with the fused model but approximate the computation 
of Zji^{6). We consider two options: 

Full BP In general, any available sum-product inference algorithm like Belief Propagation and their convergent 
tree reweighted versions |20Lll4j can be used for approximating Za{9). However, these typically require multiple 
iterations and can be sometimes slow to converge. 

OneStep TRW [l^ propose a one-step approximation that reduces to a single step of the Tree reweighted 
(TRW) family of algorithms [T3] where the roles of trees are played by individual instances. As in all TRW 
algorithms, this method guarantees that the log partition value it returns is an upper bound, but for maximization 
problems upper bounds are not very useful. 

A downside of these approaches is that there is no guarantee that the approximation leads to a valid prob- 
ability distribution. For example, we often observed that the approximate value of Zj^{6) was greater than 
^^ ,- log ^(Xs.j, w^) causing the probability of agreement to be greater than 1. 

To summarize, we would ideally like to optimize the agreement-based objective in Equation [6] exactly by 
working with the equivalent fused graphical model of Equation [9] Due to intractability, we discussed various 
ways to decompose the agreement term or approximate the corresponding fused model. As we shall show in 



Section [51 when there are noisy chques in the agreement set, the tractable decompositions turn out to be much 
more robust than methods that approximate the fused model created from erroneous cliques. 

3 Generating the agreement set 

In this section we discuss our unsupervised strategy for finding agreement sets. But first we stress that the 
importance of this step cannot be overstated. As we show in Section [5] even the best collective training schemes 
are only as good as their agreement set. This has interesting parallels with other learning tasks e.g. semi- 
supervised learning, where recent work has shown the importance of creating good neighborhood graphs |13) . 

Traditional collective extraction methods have not focused on the process of finding quality agreement sets. 
These methods usually form a clique from arbitrary repetitions of unigrams |2H [8l I15j . This is inadequate 
because of two reasons. First, any strong first order dependencies cannot be transferred with only unigram 
cliques. Second, blindly marking repetitions of a token/n-gram as a clique can inject a lot of noise in the 
agreement set. 

Instead we use a more principled strategy. We make the working assumption that significant content overlap 
among sources is caused by (approximate-)duplication of instances. So we assume that each instance has a 
hidden variable with value equal to its 'canonical instance value'. Instances inside a source will have different 
values of this variable (as duplicates are rare inside a source), whereas these values will be shared across sources, 
thus forming clusters. Assume for now that these clusters are known. Given such a cluster of deemed duplicates, 
we find maximally long segments that repeat among the instances in the cluster, and add one clique per such 
segment to the agreement set. Segment repetitions outside the cluster are considered as false matches and 
ignored. 

The task of optimally computing the clusters essentially reduces to the NP-hard multi-partite matching 
problem with suitably defined edge-weights. We tackle this by employing the following staged scheme: First, 
we order the sources using a natural criteria such as average pairwise similarity with the other sources. Each 
instance in the first source forms a singleton cluster. In stage s, we find a bipartite matching between source 
s + 1 and the clusters formed by the first s sources. An instance i in source s + 1 will be assigned to the cluster 
to which it is matched. Unmatched instances form new singleton clusters. The edge-weight between an instance 
i and a cluster is defined as the best similarity of i with any member instance of the cluster. 

When our assumption of instance duplication does not hold, say when each instance is an arbitrary natural 
language sentence, the bipartite matching scores will be low and we revert to the conventional clique generation 
scheme. As we shall see in Section [5J our strategy generates much better agreement cliques in practice. 

4 Relationship with other approaches 

We now review various approaches relevant to collective training with partially overlapping sources. We omit 
Agreement-based learning as it has already been discussed in Sections [l] and |2.3[ 

4.1 Posterior regularization (PR) 

The PR framework [TU] trains a model with task-specific linear constraints on the posterior. The constrained 
optimization problem is solved via the EM algorithm on its variational form. PR has been shown to have 
interesting relationships with similar frameworks |161 1191 [B] . 

The aspect of PR most relevant to us is its application to multi-view learning [TT] . Then the PR constraints 
translate to minimizing the Bhattacharayya distance between the various posteriors. This has two key differences 
with our setting. First their agreement set is at the level of full instances instead of arbitrary sub-parts. Moreover, 
their agreement set has no noise because the instances across views are known duplicates instead of assumed 
ones like in ours. The second and more interesting difference is that of the agreement term. 

Assuming that we have only two sources s and s' , with only one shared clique c, training the two mod- 
els is the same as learning in the presence of two-views of c. The agreement term under PR would be 
logX^v \/^s(yc)^s'(yc)j where PXYc) is the marginal of c. This is maximized when the two marginals are 
identical. In contrast, our agreement term of log^ Ps{yc)Ps'{yc) is maximized when the marginals are iden- 
tical and peaked. If both the base models are strong, their marginals will be almost peaked, resulting in little 



c 15^ 



E 



< 05- 
m 

1 ? 



> 



P2 5 




P2 5 




0.2 p-| 0.4 0.6 0.8 1 

Figure 2: Comparison of the agreenient (left) and multi-view (right) losses over two binomial posteriors 



difference between the two terms. But a difference arises in the asymmetric case when one model is strong and 
peaked and the other is weak and flat. One possible maxima for the two- view term would be the strong model 
flattening out and becoming identical to the weaker one. As our agreement term is averse to flat marginals, it will 



avoid this maxima. Figure 4.1 illustrates the difference between the two terms for two binomial distributions. 

Given such a relationship between the two terms, we compare the performance of our algorithms with the 
multi-view algorithm in Section [51 The multi-view objective will be optimized via EM because the gradient 
cannot be computed tractably for structured models. 



4.2 Label transfer 

Another set of approaches deal with transferring labeled data from one source to the other. One such inexpensive 
approach is an asymmetric staged strategy of training the model for a more confident source first, transferring 
its certain labels to the next source and so on. This requires a good ordering of sources to control error-cascades. 
In Section [5] we show that even with suitable heuristics, this scheme suffers from huge performance deviations. 
Similar label transfer ideas have been employed in training rule-based extraction wrappers [S]. 

More sophisticated methods in this class include CoBoosting 0, Co- Training [T], and the two- view Per- 
ceptron [2] that train two models in tandem by each model providing labeled data for the other. A detailed 
comparison of these models in [TO] show that these methods are less robust than methods that jointly train all 
models. 

4.3 Inference-only approaches 

Another option is to only train the base models, and perform any corrections at runtime through collective 
inference. Such strategies have been used on a variety of NLP tasks [3TJ[51[TS]. These methods usually end 
up using cliques only over unigrams, with little focus on controlling their noise. The most common practice is 
marking arbitrary repetitions of a token as a clique. As we show in Section [5] our collective training algorithms 
are significantly better than collective inference, even with identical agreement sets. A prime limitation of 
inference-only approaches is that they cannot transfer the benefits of overlap to other instances which do not 
overlap. 



5 Experimental evaluation 

We present extensive experiments over several real datasets covering a rich diversity of data characteristics. Our 
first set of experiments seek to justify collective training by showing substantial benefits over base models, and 



alternatives like staged training and collective inference discussed in Sections 4.2 and 4.3 respectively. Second, we 
study our collective training approach in detail by comparing the accuracies of the various approximations made 
in Section |2.3[ Third, we demonstrate the importance of choosing high quality agreement sets by comparing 
various set-generation schemes. Finally, we make a case that our simple gradient ascent algorithm is as accurate 
as existing traditional EM-based approaches [T71 [TT] while being considerably faster. 

Datasets: We use a corpus of 58 real datasets, each comprising multiple HTML lists. All lists in a dataset 
contain semi-structured instances relevant to a dataset-specific relation e.g. University mottos, Caldecott medal 
winners, movies by James Cagney, Supreme court cases etc. The 58 datasets exhibit a wide spectrum of 
behavior in terms of their base accuracy, number of sources, number of cliques per instance, their noise levels. 
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Figure 3: Comparing Clique Agreement against the CoUective Inference and Staged approaches 

and so on. For ease of presentation, we partition these 58 sources into ten groups by a paired criteria — base 
accuracy and relative size of the agreement sets. We create five bins for base accuracy values: 50-60, 60-70, and 
so on, and two bins for agreement set: "M" (many) when there are more than 0.5 chques per instance and "F" 
(few) otherwise. Table fTlhsts for each of the ten groups the number of datasets (#), average number of sources 
{S), number of labels {\C\), number of cliques (|^|), instances, base Fl score, and noise in the agreement set 
A. The last row in the table that lists the standard deviation of these values over all 58 sources illustrates the 
diversity of the dataset. 

Task For each dataset, we mimic a user query by seeding with a handful of structured records. These are used 
to generate labeled data out of matching instances in each list of the dataset. The goal is to learn a robust model 
for each list and extract more instances from it. All comparisons are with 3 and 7 seed records only. Bigger 
training sets are not practical in this task as the seed structured records are provided through a manual query. 
All our numbers are averaged over five random selections of the seed training set. Our base model is a conditional 
random field trained using standard context features over the neighborhood of a word, along with class prior 
and edge features. Our ground truth consists of every token manually labeled with a relevant dataset-specific 
label. Using this ground truth, we denote a clique as pure if all its members agree on their true labels, and noisy 
otherwise. We measure model accuracy by the Fl score of the extracted entities. We set A using a validation 
set. 



5.1 Benefit of collective training 

We first compare collective training, collective inference, and staged label-transfer methods with the base model, 
starting with only three labeled instances from the user. We chose the Clique agreement method (Clique) for 
collective training, and used the same agreement set for the collective training and collective inference. 

Figure [3] shows the gains and losses of the three methods over the base model for each of ten groups. Collective 
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Table 2: Comparing different training approximations in terms of Fl accuracy gain over the base model. 



training clearly perforins the best, and its gains are specially large for datasets whose base accuracy is in the 
60-80% range, and which have big agreement sets. Overall, its Fl is 87.9% in contrast to the base accuracy of 
83.7%. Even with a training size of seven, Fl improves from 87.4 to 89.2 (not shown in the figure). In contrast, 
the staged approach overall performs worse than Base and shows large swings in accuracy across datasets. It 
is highly sensitive to the ordering of sources, and the hard label-transfer often causes error-propagation to all 
downstream sources. Collective inference improves accuracy in a few cases but overall provides only a small gain 
of 0.3% beyond Base. 



5.2 Comparing collective training objectives 

We now compare the various approximations of the agreement term — Clique, Node, Instance Pair, Full (Full 
BP), and TRl (OneStep TRW) as described in Section 2.3 Table [2] shows the gains in Fl for all the approaches 
over the base model. 

Observe that Clique and Node agreement are two of the best performing methods. We explore two possi- 
ble reasons for why they score over other approaches that fuse the influence of multiple cliques. One partial 
explanation is that 15% of the cliques in our agreement set are noisy. In such a case, fused methods would 
try hard at maximizing the likelihood of a wrongly fused graph. In contrast, the Clique and Node agreement 
models decompose over cliques, so they can choose to ignore the terms corresponding to erroneous cliques during 
optimization. A second reason common to all the losing approaches is the inexact nature of the optimization of 
their training objectives. To understand which of these is a plausible reason, we remove all noisy cliques using 
the ground truth and compare Clique and Instance Pair agreement. Accuracy improves by less than 0.6 Fl 
in both, and Clique continues to score over Instance Pair. This indicates that inexact gradient computation is 
perhaps a major reason why more complex fused approaches perform worse. 

We also see that there is little difference between the Clique and Node agreement models. While one possible 
reason is the weakness of any first-order dependency in the true model, we find another interesting reason for 
this behavior. We note that in a general n-gram agreement clique, only a few positions might be erroneous. For 
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Figure 4: Effect of ciique noise on collective training 

example, the 15% noise measured at segment level reduces to 5.6% at position level. Since Node decomposes 
the clique over positions, it can ignore wrong positions during optimization and be more robust against noisy 



cliques. We corroborate this in Figure 4(a) where for each of the 58 datasets, we plot the difference in Fl of 
Clique and Node versus clique noise in the dataset's agreement set. We observe that whenever Clique performs 
sufficiently worse than Node there is high noise in the cliques. In low noise settings, Clique is often much better 
than Node. 



5.3 Noise in the Agreement Set 

As discussed in Section l3J it is important to choose high quality agreement sets. Figure [4(b) | shows the Fl scores 
of Node under three clique generation schemes of varying noise. The rightmost points are for the conventional 
practice of choosing arbitrary unigram repetitions as cliques and has a noise of 17.3% at position level. The 
middle point is our method of clique generation where we reduce the noise to 5.6% and the leftmost are ideal 
cliques with zero noise obtained by using the ground truth to remove all noisy unigrams. We find that our clique 
selection method enjoys accuracy very close to that with noise-free cliques and the accuracy with carelessly 
chosen cliques is much lower. 



5.4 Comparison with EM-based approaches 



In Section 4.1 we described how the PR framework [TT] is applicable to our problem. We show its results in the 
last column of Table [2j The accuracy of PR is comparable to the Clique method showing that distance-based 
and likelihood terms serve similar goals in our setting. However, the PR approach is more than four times slower 
than our likelihood objective maximized using gradient ascent. The PR objective requires the EM algorithm for 
training. In typical feature-based structured models, the M-step tends to be expensive and it is best not wasted 
on working with fixed E-values. To evaluate the tradeoffs between EM and gradient-based training we also ran 
the EM algorithm of [T7| whose gradient-based version we call TRl in Table [2] We found the EM trainer (not 
shown) to have an Fl 0.4% less than TRl and also a factor of two slower. 



6 Conclusion 

We presented a framework for jointly training multiple extraction models exploiting partial content overlap across 
sources. Partial overlap opens up a slew of problems — choosing a noise-free agreement set, a training objective 
or its approximation, and an optimization algorithm. We showed that while decomposing the agreement term 
over cliques provides a tractable yet accurate method of agreement, it also turns out to be more robust against 
clique noise than methods that approximate the fused graph. We also presented a strategy for computing clean 
agreement sets that is far superior to the naive alternative. Through extensive experiments on various real 
datasets we showed that our agreement-term decompositions on cliques and positions are more robust, accurate, 
and faster than alternatives like multi-view learning. 
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